--- layout: page title: Lab 5 - Panel Data permalink: /labs/lab5/ parent: Labs nav_order: 5 ---
Using panel data helps us partly circumvent the issue of omitted
variable bias, rather than controlling for all potential
time-invariant confounders. We can simply include fixed effects that
capture all of these confounders - whether observed or unobserved.
We know that bias can come from variables that change over time
across all units. To minimise bias from these factors, we can include in
our model time-effect controls for any common impact that affects all
units. Put differently, time-fixed effects are time-specific factors
that affect all entities at the same point in time but evolve over
time.
A model “unit-fixed effects” is a model to which we have added a dummy variable for every unit (county, individual, or state). By adding these to our model, we control for all differences between units that do not change over time. Unit fixed effects are essentially factors that change from one unit to another, but remain constant over time - this includes all time-invariant unobserved (and, importantly, unobservable) heterogeneity. Put differently, entity or unit-fixed effects are factors that are constant over time, but vary within entities.
The coefficient of each dummy variable shifts the intercept, but does not affect the slope.
\[Y_{i,t} = \beta_1 X_{i,t} + \alpha_i +
\mu_{i,t}\]
We can combine both time and unit fixed effects at the same time. These models are referred as Two-way fixed effects models. By including both fixed effects we are reducing bias that comes from:
Using panel data we cannot assume that regression errors are
independent of each other. Indeed, standard errors are likely correlated
within the same unit or the same year. Therefore, we need to cluster
standard errors to adjust/account for this correlation. Using
Clustered standard errors we allow for heteroskedasticity and
autocorrelation within an entity, but treat the errors as uncorrelated
across entities.
Before starting this seminar
Create a folder called “lab5”
Download the data (you can use this button or the one at the top, or read csv files directly from github):
Open an R script (or Markdown file) and save it in our “lab5” folder.
Set your working directory using the setwd() function or by clicking on “More“. For example setwd(“~/Desktop/Causal Inference/2024/Lab5”)
Let’s install an load the packages that we will be using in this lab:
library(jtools) # generate plots with regression coefficients
library(stargazer) # generate formated regression tables
library(texreg) # generate formatted regression tables
library(tidyverse) # to conduct some tidy operations
library(ggplot2) # to generate plots
library(plm) # conduct one-way and two-way fixed effects
library(estimatr) # to conduct ols and provides robust standard errors
library(lmtest) # calculates the variance-covariance matrix to be clustered by group.
library(sandwich) # to calculate heteroscedasticity-Consistent Covariance Matrix Estimation
library(haven) # upload dta files
In this seminar, we will cover the following
topics:
1. Conduct a time fixed effects model using the lm()
function
2. Conduct a time and unit fixed model using the lm()
function
3. Conduct a time fixed effects model using the plm()
function
4. Conduct a time and unit fixed effect model using the
plm() function.
4. Correct standard errors obtained from the plm()
function.
5. Conduct a Hausmann test and F-test
Today, we will work with data from Gomez et al work on The Republicans Should Pray for Rain: Weather, Turnout, and Voting in U.S. Presidential Elections. In this paper, Gomez et al seek to answer the following question: Does bad weather affect turnout? To address this question, they looked at 14 U.S presidential elections. Using GIS interpolations and weather data from 22,000 U.S weather stations, they provide election day estimates of rain and snow at the county level.
The authors find that rain significantly reduces voter participation
by a rate of just less than 1% per inch. Meanwhile, an inch of snowfall
decreases turnout by almost .5%. The authors also find that poor weather
benefits Republicans. The authors’ main outcome of interest is vote
turnout at the county level.
Their data consists of over 3,000 counties in the continental United
States for each presidential election from 1948 to 2000. The
meteorological variables were drawn from the National Climatic Data
Center that contained Day data report of various measures of the day’s
weather, including rainfall and snow, for over 20,000 weather stations.
In total the authors collected up to 43,340 observations (N = 3,115,
max, and T = 14 elections).
The main “treatment” variable of this study is estimated rainfall and snowfall measured in inches (1 in = 2,54 cm). The authors used two alternative measures of weather. First, they calculated the normal (average) rainfalls and snowfalls for each election data for each county using data from 1948-2000 time span. Then, they subtracted the appropriate daily normal value from the rainfall or snowfall estimated to have occurred on each election day under analysis.
The authors control for a number of socio-economic factors that are associated with voter turnout such as High School Graduates in the county, median household income, and how rural the county is.
The authors find when measured as deviations from their normal
values, rain and snow elicit a negative and statistically significant
effect on voter turnout. Thus, evidence supports the claim that bad
weather lowers voters turnout. The authors also find that when rain and
snow increase above their respective election day normal, the vote share
of Republican presidential candidates increases.
A description of some of the variables that we will use today is below:
| Variable | Description |
|---|---|
Year |
Election Year |
State |
State name |
County |
County name |
FIPS_County |
County id |
Turnout |
Turnout presidential election as a percentage of Voting-age population |
Rain |
GIS rainfall estimate |
Snow |
GIS snowfall estimate |
Rain_Dev |
Deviation from normal rainfall |
Snow_Dev |
Deviation from normal snowfall |
GOPVoteShare |
Republican Candidate’s Vote Share |
PcntBlack |
Estimated percentage of Black population |
ZPcntHSGrad |
Estimated percentage high school grads, standarised by years at the county level |
FarmsPerCap |
Farms per capita |
Unemploy |
Estimated Unemployment Rate |
AdjIncome |
CPI Adj Median Household Income (1982-84=100), in 10000s |
Closing |
length of time between date voter must register and election day |
Literacy |
Proportion of legislator’s children that are female |
Poll Tax |
Literacy test |
Closing |
Poll tax |
Yr52 |
Dummy election 1952 |
Yr56 |
Dummy election 1956 |
Yr.. |
Dummy election … |
Yr2000 |
Dummy election 2000 |
Now let’s load the data. There are two ways to do this:
You can load the dataset from your laptop using the
read_dta() function. We will call this data frame
weather.
# Set your working directory
#setwd("~/Desktop/Causal Inference/2024/Lab5")
#
#setwd("/cloud/project")
Or you can download the data from the course website from following url: https://widening-sarajevo.github.io/CI/data/Weather_publicfile.dta.
Let us start with a few descriptive statistics to become familiar
with the data. First, as always, we use the head() function
to get an impression of the data
head(weather)
Year State County FIPS_County Turnout Turnout_Lag GOPVoteShare PcntBlack
1 1948 ALABAMA AUTAUGA 1001 12.87726 13.06288 8.553655 47.00900
2 1952 ALABAMA AUTAUGA 1001 23.53037 12.87726 34.098790 45.16853
3 1956 ALABAMA AUTAUGA 1001 23.08282 23.53037 37.472670 43.61412
4 1960 ALABAMA AUTAUGA 1001 26.49146 23.08282 43.342130 42.09053
5 1964 ALABAMA AUTAUGA 1001 29.25427 26.49146 85.809250 33.70723
6 1968 ALABAMA AUTAUGA 1001 57.02074 29.25427 7.787201 27.56027
ZPcntHSGrad FarmsPerCap Unemploy AdjIncome Closing Property Literacy PollTax
1 -1.0891030 0.0882406 5.642747 0.3612264 10 0 1 1
2 -0.9176651 0.0861690 3.272751 0.6096017 10 0 0 1
3 -0.8140256 0.0535593 3.636375 0.8677695 10 0 0 1
4 -0.7632608 0.0217194 4.000000 1.1244930 10 0 0 1
5 -0.4192493 0.0221520 3.480000 1.6020970 10 0 0 0
6 -0.1011990 0.0224998 2.960000 1.8978450 10 0 0 0
Motor GubElection SenElection Yr52 Yr56 Yr60 Yr64 Yr68 Yr72 Yr76 Yr80 Yr84
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0
4 0 0 1 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 1 0 0 0 0
Yr88 Yr92 Yr96 Yr2000 Rain Snow GOPVoteShare_3MA RD_GOPVS SD_GOPVS
1 0 0 0 0 0.23 0 6.509238 0.6533802 -0.0491263
2 0 0 0 0 0.00 0 7.620456 -0.8238719 -0.0468951
3 0 0 0 0 0.00 0 17.076720 -1.5852350 -0.0952355
4 0 0 0 0 0.00 0 26.708370 -2.9429600 -0.1461401
5 0 0 0 0 0.00 0 38.304530 -2.5150900 -0.2312726
6 0 0 0 0 0.00 0 55.541350 -8.7608620 -0.3877415
Rain_Dev Snow_Dev
1 0.1003774 -0.0075472
2 -0.1081132 -0.0061538
3 -0.0928302 -0.0055769
4 -0.1101887 -0.0054717
5 -0.0656604 -0.0060377
6 -0.1577359 -0.0069811
Now, let’s have a look at the number of elections that are included
in the data set. We know that the authors are looking at presidential
elections, so each unique year in the data can cover only one and the
same presidential election. We use the unique() command to
retrieve unique values of the Year variable. Then, you can
use the length() command to display the number of unique
values.
# Retrieving unique values of the variable
unique(weather$Year)
[1] 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000
# Retrieving the number of the unique values of the variable.
length(unique(weather$Year))
[1] 14
So the data set covers 14 different presidential elections - as the
unique() command shows us, these are all elections that
took place between 1948 and 2000. Let’s do the same now for the number
of states.
Exercise 1: How many US states are covered by the data -
which ones are missing?
# Retrieving the number of the unique values of the variable.
length(unique(weather$State))
[1] 48
# Retrieving unique values of the variable
unique(weather$State)
[1] "ALABAMA" "ARIZONA" "ARKANSAS" "CALIFORNIA"
[5] "COLORADO" "CONNECTICUT" "DELAWARE" "FLORIDA"
[9] "GEORGIA" "IDAHO" "ILLINOIS" "INDIANA"
[13] "IOWA" "KANSAS" "KENTUCKY" "LOUISIANA"
[17] "MAINE" "MARYLAND" "MASSACHUSETTS" "MICHIGAN"
[21] "MINNESOTA" "MISSISSIPPI" "MISSOURI" "MONTANA"
[25] "NEBRASKA" "NEVADA" "NEW HAMPSHIRE" "NEW JERSEY"
[29] "NEW MEXICO" "NEW YORK" "NORTH CAROLINA" "NORTH DAKOTA"
[33] "OHIO" "OKLAHOMA" "OREGON" "PENNSYLVANIA"
[37] "RHODE ISLAND" "SOUTH CAROLINA" "SOUTH DAKOTA" "TENNESSEE"
[41] "TEXAS" "UTAH" "VERMONT" "VIRGINIA"
[45] "WASHINGTON" "WEST VIRGINIA" "WISCONSIN" "WYOMING"
The data cover 48 US states, so two are missing. We can see that the
data cover all contiguous US states - the two missing ones are Alaska
and Hawaii. Is there a reason to exclude them?
There is! Alaska and Hawaii joined the Union only in 1959 - so the authors would have had shorter time-series for these two states.
Let’s start by paying attention to the temporal dimension of our voting data. Conventional wisdom has it that each and every election is different. Accordingly, we want to acknowledge these differences across elections and take them into account when estimating our models.
Let’s first have a look at the broad picture by plotting the outcome
of interest across time. We can use the boxplot() function
to this. The syntax is the following
boxplot(Y ~ X, data= data, xlab="X Label", ylab="Y Label").
Exercise 2: Plot GOPVoteShare over
time
boxplot(GOPVoteShare ~ Year,
data = weather,
xlab = "Election Year",
ylab = "GOP Vote Share",
las = 2) # This argument rotates the axis labels
The vote share cast for the GOP varies widely. Note that the data set
includes an observation for each county. The box indicates the
interquartile range (i.e. the difference between the third and first
quartile of data points) and the solid lines the median value for each
year. There are meaningful differences: Note how GOP vote share is
larger in 1972, when McGovern (ever heard of him?) had no chance against
Nixon. Importantly, we still see a rather broad range of GOP vote shares
- this is the variation introduced by individual counties and their
specifics. Considering only temporal differences here, we do not
distinguish any other effects.
Let’s now run a simple OLS model without considering any variation
introduced by temporal differences or unit (i.e. county) specifics. Such
a model is called pooled OLS as we simply pool all observations
in our data set to estimate a coefficient. By design, all observations
are considered to be independent of each other in such model.
Exercise 3: Regress GOPVoteShare on
Rain
Estimate two models: First, a simple bivariate model and second a
model that includes important covariates: PcntBlack,
ZPcntHSGrad, FarmsPerCap,
AdjIncome and Literacy. Interpret your model
estimates.
# Bivariate pooled OLS
ols_bivariate <- lm(GOPVoteShare ~ Rain, data=weather)
summary(ols_bivariate)
Call:
lm(formula = GOPVoteShare ~ Rain, data = weather)
Residuals:
Min 1Q Median 3Q Max
-50.845 -9.826 1.035 10.957 45.718
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.87158 0.07924 642.023 <2e-16 ***
Rain 0.50357 0.30494 1.651 0.0987 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.55 on 43338 degrees of freedom
Multiple R-squared: 6.292e-05, Adjusted R-squared: 3.985e-05
F-statistic: 2.727 on 1 and 43338 DF, p-value: 0.09867
# Multivariate pooled OLS
ols_covariates <- lm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, data=weather)
summary(ols_covariates)
Call:
lm(formula = GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather)
Residuals:
Min 1Q Median 3Q Max
-52.955 -9.581 0.249 9.806 69.800
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.75850 0.33286 137.469 < 2e-16 ***
Rain 1.08860 0.27941 3.896 9.79e-05 ***
PcntBlack -0.23050 0.00563 -40.940 < 2e-16 ***
ZPcntHSGrad 2.01974 0.08536 23.661 < 2e-16 ***
FarmsPerCap 43.40966 1.96753 22.063 < 2e-16 ***
AdjIncome 3.00524 0.13687 21.956 < 2e-16 ***
Literacy -3.97069 0.24021 -16.530 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.21 on 43333 degrees of freedom
Multiple R-squared: 0.1645, Adjusted R-squared: 0.1644
F-statistic: 1422 on 6 and 43333 DF, p-value: < 2.2e-16
The bivariate model estimates a - hardly significant - association between rain and GOP vote share of 0.5%-points per inch. As we know, however, such bivariate association does not account for several other effects. Indeed, all covariates included in the multivariate pooled OLS are highly correlated with GOP vote share - and the estimate of rain on election day doubles to about 1%-point in the multivariate model.
Let’s now move away from the pooled - or naive - OLS model. We know that despite accounting for the covariates, the multivariate model does not account for variance from two important sources: (i) general temporal development, (ii) unit-specific time-invariant variance. In a first step, we account for the former. Let’s see how acknowledging general election effects changes our estimate.
In other words, we will now estimate a fixed-effects
model. You can use the lm() function to include
time-fixed effects. If you add the Year variable as
factorised variable, a dummy variable for each year is added to the
regression equation. In other words, we allow each presidential election
to have a different intercept but estimate a single slop for the
independent variables.
Exercise 4: Estimate the same bivariate and multivariate models as above including time-fixed effects.
Note that this is not the same as simply adding the Year
variable as an independent variable. If you did the latter, - with
Year being a continuous variable - R would simply
understand it to be a single, general and continuous time
trend.
# Bivariate time FE
FE_bivariate <- lm(GOPVoteShare ~ Rain + factor(Year), data=weather)
summary(FE_bivariate)
Call:
lm(formula = GOPVoteShare ~ Rain + factor(Year), data = weather)
Residuals:
Min 1Q Median 3Q Max
-52.403 -7.759 0.849 8.815 52.372
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.2568 0.2465 159.232 < 2e-16 ***
Rain -1.0548 0.2845 -3.708 0.000209 ***
factor(Year)1952 15.9501 0.3442 46.340 < 2e-16 ***
factor(Year)1956 14.4377 0.3411 42.322 < 2e-16 ***
factor(Year)1960 11.3954 0.3425 33.275 < 2e-16 ***
factor(Year)1964 4.9604 0.3438 14.428 < 2e-16 ***
factor(Year)1968 4.5954 0.3428 13.407 < 2e-16 ***
factor(Year)1972 27.4617 0.3407 80.594 < 2e-16 ***
factor(Year)1976 7.1658 0.3439 20.837 < 2e-16 ***
factor(Year)1980 14.3616 0.3436 41.794 < 2e-16 ***
factor(Year)1984 23.2201 0.3428 67.744 < 2e-16 ***
factor(Year)1988 16.7244 0.3424 48.841 < 2e-16 ***
factor(Year)1992 0.7715 0.3396 2.272 0.023110 *
factor(Year)1996 5.5399 0.3422 16.187 < 2e-16 ***
factor(Year)2000 17.9530 0.3406 52.704 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.36 on 43325 degrees of freedom
Multiple R-squared: 0.2621, Adjusted R-squared: 0.2618
F-statistic: 1099 on 14 and 43325 DF, p-value: < 2.2e-16
# Multivariate time FE
FE_covariates <- lm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy + factor(Year), data=weather)
summary(FE_covariates)
Call:
lm(formula = GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy + factor(Year), data = weather)
Residuals:
Min 1Q Median 3Q Max
-49.622 -7.631 0.325 7.743 73.060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.269963 0.344545 113.976 <2e-16 ***
Rain 0.114292 0.255610 0.447 0.6548
PcntBlack -0.238743 0.004769 -50.056 <2e-16 ***
ZPcntHSGrad 2.924763 0.087118 33.572 <2e-16 ***
FarmsPerCap 35.762215 1.720000 20.792 <2e-16 ***
AdjIncome 0.140178 0.191640 0.731 0.4645
Literacy -2.333959 0.213516 -10.931 <2e-16 ***
factor(Year)1952 16.099061 0.311004 51.765 <2e-16 ***
factor(Year)1956 15.075127 0.314584 47.921 <2e-16 ***
factor(Year)1960 12.700441 0.324737 39.110 <2e-16 ***
factor(Year)1964 5.925331 0.349124 16.972 <2e-16 ***
factor(Year)1968 4.925047 0.365851 13.462 <2e-16 ***
factor(Year)1972 26.974076 0.416674 64.737 <2e-16 ***
factor(Year)1976 7.240351 0.423082 17.113 <2e-16 ***
factor(Year)1980 14.700643 0.392609 37.443 <2e-16 ***
factor(Year)1984 23.479995 0.394124 59.575 <2e-16 ***
factor(Year)1988 16.906547 0.410397 41.196 <2e-16 ***
factor(Year)1992 0.952088 0.393233 2.421 0.0155 *
factor(Year)1996 6.036349 0.389940 15.480 <2e-16 ***
factor(Year)2000 18.317105 0.384949 47.583 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.99 on 43320 degrees of freedom
Multiple R-squared: 0.406, Adjusted R-squared: 0.4058
F-statistic: 1558 on 19 and 43320 DF, p-value: < 2.2e-16
Do you notice the vast differences of our estimated effect of
Rain? In the bivariate FE model, the estimated effect is
negative, so one inch of rain would be associated with a
1%-point decrease in GOP vote share. Once we control for the same set of
covariates as above, the model estimates that an increase in one unit
ofRain is not associated with GOP vote share at all!
Let’s pause a moment to make sure we grasp the difference between a
pooled OLS and the FE estimator. Simply speaking, the pooled OLS
considers each observation but does not differentiate them in any
particular way. Thus, the bivariate model simply estimates the
mathematical association between the two variables, using all data
points:
require(modelr)
require(ggplot2)
grid <- data.frame(Intercept=1, Rain=seq_range(weather$Rain, 10))
grid$pred <- predict(ols_bivariate,grid)
ggplot(weather, aes(Rain)) +
geom_point(aes(y = GOPVoteShare)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
The logic of the FE estimator is different, however. This estimator
looks at and compares data points within a certain period. It
estimates a single coefficient (or slope) for independent variables, but
allows each period to have an individual intercept, thus capturing
general time trends that affect all units. Let’s look at the coefficient
of Rain in the bivariate FE model now:
weather$fYear <- factor(weather$Year) # Creating a factor variable makes it easier to plot
grid_fixdum <- expand(data.frame(Intercept=1, Rain=seq_range(weather$Rain, 14), Year=unique(weather$fYear)),
Rain, Year)
grid_fixdum$pred <- predict(FE_bivariate,grid_fixdum)
ggplot(weather, aes(Rain)) +
geom_point(aes(y = GOPVoteShare, colour=fYear)) +
geom_smooth(aes(y = GOPVoteShare), method='lm', se=FALSE, colour="black")+
geom_line(data=grid_fixdum, aes(x=Rain, y=pred, colour=Year))
In fact, looking only at the effects within
Years, we can see that the aggregate slope is even slightly
negative!
Even with our FE estimates, however, there is an important problem.
Using the lm() command, R estimates the time dummies as
additional variables. This affects the measures of uncertainty and,
therefore, statistical significance as observations within the same year
are not independent. In fact, we want to estimate FE models precisely
because we have reason to believe that they are not independent of each
other.
To adjust for this dependence of observations, we must calculate
robust - or clustered or panel-corrected - standard
errors. This is important and should always be done when
estimating FE models - otherwise standard errors will be (heavily)
biased. In turn, this would mean we make wrong inferences.
Luckily, we can easily adjust standard errors, either manually using
special packages such as the sandwich package or
automatically as part of the regression estimation. To do the latter, we
can use the lm_robust() command from the
estimatr package. It’s syntax is similar to the general
lm() command, but it allows you to simply specify fixed
effects by adding a fixed_effects= argument, followed by
the variable indicating FE. Robust standard errors are automatically
calculated when you use that function.
Exercise 5: Estimate the same time-FE models as above with
robust standard errors using the lm_robust()
function.
require(estimatr)
# Bivariate time-FE with roust SEs
FE_robust_bivariate <- lm_robust(GOPVoteShare ~ Rain, fixed_effects= factor(Year), data=weather)
summary(FE_robust_bivariate)
Call:
lm_robust(formula = GOPVoteShare ~ Rain, data = weather, fixed_effects = factor(Year))
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Rain -1.055 0.256 -4.12 3.798e-05 -1.557 -0.553 43325
Multiple R-squared: 0.2621 , Adjusted R-squared: 0.2618
Multiple R-squared (proj. model): 0.0003172 , Adjusted R-squared (proj. model): -5.8e-06
F-statistic (proj. model): 16.97 on 1 and 43325 DF, p-value: 3.798e-05
# Multivariate time-FE with roust SEs
FE_robust_covariates <- lm_robust(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, fixed_effects= factor(Year), data=weather)
summary(FE_robust_covariates)
Call:
lm_robust(formula = GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, fixed_effects = factor(Year))
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Rain 0.1143 0.255639 0.4471 6.548e-01 -0.3868 0.6153 43320
PcntBlack -0.2387 0.005784 -41.2799 0.000e+00 -0.2501 -0.2274 43320
ZPcntHSGrad 2.9248 0.092533 31.6078 8.493e-217 2.7434 3.1061 43320
FarmsPerCap 35.7622 1.732838 20.6379 3.562e-94 32.3658 39.1586 43320
AdjIncome 0.1402 0.187772 0.7465 4.553e-01 -0.2279 0.5082 43320
Literacy -2.3340 0.270200 -8.6379 5.918e-18 -2.8636 -1.8044 43320
Multiple R-squared: 0.406 , Adjusted R-squared: 0.4058
Multiple R-squared (proj. model): 0.1953 , Adjusted R-squared (proj. model): 0.195
F-statistic (proj. model): 1343 on 6 and 43320 DF, p-value: < 2.2e-16
Exercise 6: Display the regression output of all six models in a single table. Interpret the models.
require(texreg)
# This is one possible option, there are many others.
screenreg(l=list(ols_bivariate, ols_covariates, FE_bivariate, FE_covariates, FE_robust_bivariate, FE_robust_covariates), include.ci = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), custom.header = list("Pooled OLS" = 1:2, "Time FE" = 3:4, "Robust Time FE" = 5:6),
omit.coef = "(Intercept)")
=================================================================================================
Pooled OLS Time FE Robust Time FE
----------------------- -------------------------- --------------------------
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
-------------------------------------------------------------------------------------------------
Rain 0.50 1.09 *** -1.05 *** 0.11 -1.05 *** 0.11
(0.30) (0.28) (0.28) (0.26) (0.26) (0.26)
PcntBlack -0.23 *** -0.24 *** -0.24 ***
(0.01) (0.00) (0.01)
ZPcntHSGrad 2.02 *** 2.92 *** 2.92 ***
(0.09) (0.09) (0.09)
FarmsPerCap 43.41 *** 35.76 *** 35.76 ***
(1.97) (1.72) (1.73)
AdjIncome 3.01 *** 0.14 0.14
(0.14) (0.19) (0.19)
Literacy -3.97 *** -2.33 *** -2.33 ***
(0.24) (0.21) (0.27)
factor(Year)1952 15.95 *** 16.10 ***
(0.34) (0.31)
factor(Year)1956 14.44 *** 15.08 ***
(0.34) (0.31)
factor(Year)1960 11.40 *** 12.70 ***
(0.34) (0.32)
factor(Year)1964 4.96 *** 5.93 ***
(0.34) (0.35)
factor(Year)1968 4.60 *** 4.93 ***
(0.34) (0.37)
factor(Year)1972 27.46 *** 26.97 ***
(0.34) (0.42)
factor(Year)1976 7.17 *** 7.24 ***
(0.34) (0.42)
factor(Year)1980 14.36 *** 14.70 ***
(0.34) (0.39)
factor(Year)1984 23.22 *** 23.48 ***
(0.34) (0.39)
factor(Year)1988 16.72 *** 16.91 ***
(0.34) (0.41)
factor(Year)1992 0.77 * 0.95 *
(0.34) (0.39)
factor(Year)1996 5.54 *** 6.04 ***
(0.34) (0.39)
factor(Year)2000 17.95 *** 18.32 ***
(0.34) (0.38)
-------------------------------------------------------------------------------------------------
R^2 0.00 0.16 0.26 0.41 0.26 0.41
Adj. R^2 0.00 0.16 0.26 0.41 0.26 0.41
Num. obs. 43340 43340 43340 43340 43340 43340
RMSE 13.36 11.99
=================================================================================================
*** p < 0.001; ** p < 0.01; * p < 0.05
## drop or amend the 'include.ci' option to display CIs instead of SEs - which is default for lm_robust
Note that the default display of the lm_robust() differs
from the standard OLS output we encountered so far. Instead of standard
errors, the table indicates the 95% confidence intervals of the
estimates - this is indeed an increasingly popular way to present
results as confidence intervals give a clearer indication of the
effect’s uncertainty. Also, all significant results are marked by only
one asterisk, which indicates that the value of the null hypothesis
(i.e. 0) falls outside the confidence interval.
Pay attention to the significant differences between the OLS models
and the FE models. Based on the pooled OLS model, we would have thought
that rain is associated with an increase in GOP vote share
once we control for covariates. However, doing so does not account for
temporal changes and is based on comparisons across different elections.
In other words, a pooled OLS compares different observations from
different points in time of different units to each other.
Rain (or AdjIncome) and
GOPVoteShare once we include a list of (time-variant)
covariates into our model. Note that in the FE models we are estimating
effects of Rain within each individual election year and do
not simply compare observations from different election years
to each other.
Rather than using lm() or lm_robust(). We can
use the plm() function from the plm package
that allows conducting unit and time fixed effect. The syntax is quite
simple:
plm(formula, effect = "", model = "", index = c("unit", "year"), data = data).
The effect argument sets what type of model you want to
run, some of the options are “individual”, which is unit-fixed effect,
“time” for time-fixed effects, and “twoways” for two-way fixed effects.
The model argument determines the estimation method. If you
want to calculate the coefficients by demeaning both dependent and
independent variables, and then calculate estimates using deviations
from the mean, you need to set model argument equal to
"within". What essentially happens with the “within”
estimation is that subtracts group-specific means values from the
dependent variable and explanatory variables, which removes
unit-specific and time-specific effects. If you want to estimate the
coefficients using the first difference estimator, you can set the
model argument equal to "fd". The
index argument tells the function what is the structure of
the data i.e. what variable contains the time periods and what variable
contains the units, so we need to tell the function what variable is the
entity identifiers and the time identifier.
Now, let’s estimate this using a fixed-effects model, using the
plm() function. Let’s start conducting time FE model for
the Turnout variable for both “treatment” variables
Rain_Dev and Snow_Dev. Let’s conduct the
simplest version without covariates.
Exercise 7: Run time FE model for both weather variables
(Rain_Dev and Snow_Dev in separate models),
set effects equal to “time”. Set the model argument equal
to "within" set the index argument equal to
c("FIPS_County", "Year"). Store the output of your time FE
model into an object and then use the summary() to check
for your results. Interpret the results of both
estimations.
turnout_time_plm_rain <- plm(Turnout ~ Rain_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_time_plm_rain)
Oneway (time) effect Within Model
Call:
plm(formula = Turnout ~ Rain_Dev, data = weather, effect = "time",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-59.7566 -8.8644 1.1638 9.8351 51.2478
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev 2.1607 0.3053 7.0772 1.493e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 8575500
Residual Sum of Squares: 8565600
R-Squared: 0.0011547
Adj. R-Squared: 0.00083197
F-statistic: 50.0866 on 1 and 43325 DF, p-value: 1.4934e-12
turnout_time_plm_snow <- plm(Turnout ~ Snow_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_time_plm_snow)
Oneway (time) effect Within Model
Call:
plm(formula = Turnout ~ Snow_Dev, data = weather, effect = "time",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-59.8279 -8.8736 1.1604 9.8141 50.9186
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev 1.0105 0.2592 3.8984 9.697e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 8575500
Residual Sum of Squares: 8572400
R-Squared: 0.00035066
Adj. R-Squared: 2.7633e-05
F-statistic: 15.1976 on 1 and 43325 DF, p-value: 9.697e-05
We find that a unit increase on deviation of rainfall from a “normal”
election day weather is associated with a 2.16 increase in
turnout, once we include time-fixed effects. The effect of snowfalls
seems to be smaller at 1.0105. Note that the results of
these bivariate models are in the opposite direction of what the authors
found in their study. Let’s do the same but now let’s look at vote share
for Republicans.
Exercise 8: Run a time FE model using the plm()
function, but look at the effect of rainfall and snowfall on the vote
share for Republicans (GOPVoteShare). Remember to set the
effect argument equal to "time" and time
model argument equal to within. Store the
output from the plm() function into an object. Use the
summary() command check the results. Interpret the
coefficient from the Rain_Dev and Snow_Dev
“treatment” variables. Doabove average rainfalls and snowfalls increase
or decrease GOP vote share? Are the results statistically
significant?
gop_time_plm_rain <- plm(GOPVoteShare ~ Rain_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_time_plm_rain)
Oneway (time) effect Within Model
Call:
plm(formula = GOPVoteShare ~ Rain_Dev, data = weather, effect = "time",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-51.6679 -7.7675 0.8605 8.7976 52.4383
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev 1.93450 0.28998 6.6712 2.567e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 7735200
Residual Sum of Squares: 7727300
R-Squared: 0.0010262
Adj. R-Squared: 0.00070339
F-statistic: 44.5055 on 1 and 43325 DF, p-value: 2.5669e-11
gop_time_plm_snow <- plm(GOPVoteShare ~ Snow_Dev, effect = "time", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_time_plm_snow)
Oneway (time) effect Within Model
Call:
plm(formula = GOPVoteShare ~ Snow_Dev, data = weather, effect = "time",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-52.10854 -7.77081 0.85456 8.81168 52.38089
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev 0.017652 0.246213 0.0717 0.9428
Total Sum of Squares: 7735200
Residual Sum of Squares: 7735200
R-Squared: 1.1864e-07
Adj. R-Squared: -0.00032302
F-statistic: 0.00514001 on 1 and 43325 DF, p-value: 0.94285
We find that a unit increase in the deviation from the average rainfalls on election day is associated with an increase in vote share for the Republicans by 1.93 percent, once we control for time-FE. Meanwhile, a unit increase in the deviation from the average snowfall is associated with a 0.01 increase in vote share for Republicans. However, this last result is not statistically significant. These results are more in line with what the authors found in their study.
With the plm() function we can also conduct unit-FE. We
only need to do is to set the effect argument equal to
"individual". Let’s do that:
Exercise 9: Conduct a unit FE looking at the effect of
weather on voter turnout (Turnout) and vote share for
Republicans (GOPVoteShare). Use both measures of weather
Rain_Dev and Snow_Dev. Store the output of
your panel data estimations into an object and use the
summary() function to check the results.
# Turnout - rain
turnout_unit_plm_rain <- plm(Turnout ~ Rain_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_unit_plm_rain)
Oneway (individual) effect Within Model
Call:
plm(formula = Turnout ~ Rain_Dev, data = weather, effect = "individual",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-43.06166 -4.68951 0.27929 5.33676 51.51613
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev -2.68231 0.17749 -15.113 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 3015400
Residual Sum of Squares: 2998400
R-Squared: 0.005646
Adj. R-Squared: -0.071358
F-statistic: 228.392 on 1 and 40224 DF, p-value: < 2.22e-16
# Turnout - snow
turnout_unit_plm_snow <- plm(Turnout ~ Snow_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_unit_plm_snow)
Oneway (individual) effect Within Model
Call:
plm(formula = Turnout ~ Snow_Dev, data = weather, effect = "individual",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-44.42899 -4.72004 0.29016 5.32023 51.71578
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -0.89117 0.16242 -5.4869 4.116e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 3015400
Residual Sum of Squares: 3013100
R-Squared: 0.00074789
Adj. R-Squared: -0.076636
F-statistic: 30.1056 on 1 and 40224 DF, p-value: 4.1161e-08
# GOP vote share - rain
gop_unit_plm_rain <- plm(GOPVoteShare ~ Rain_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_unit_plm_rain)
Oneway (individual) effect Within Model
Call:
plm(formula = GOPVoteShare ~ Rain_Dev, data = weather, effect = "individual",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-59.21455 -7.53955 0.63525 7.74417 69.37792
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev 4.96120 0.26175 18.954 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 6579300
Residual Sum of Squares: 6521100
R-Squared: 0.0088523
Adj. R-Squared: -0.067904
F-statistic: 359.253 on 1 and 40224 DF, p-value: < 2.22e-16
# GOP vote share - snow
gop_unit_plm_snow <- plm(GOPVoteShare ~ Snow_Dev, effect = "individual", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_unit_plm_snow)
Oneway (individual) effect Within Model
Call:
plm(formula = GOPVoteShare ~ Snow_Dev, data = weather, effect = "individual",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-56.68786 -7.68837 0.51442 7.69538 68.74177
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -1.36523 0.23991 -5.6906 1.274e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 6579300
Residual Sum of Squares: 6574100
R-Squared: 0.00080443
Adj. R-Squared: -0.076575
F-statistic: 32.3834 on 1 and 40224 DF, p-value: 1.2744e-08
Using County-level FE, we find results that are in line with the findings reported by Gomez et al. We find that a unit increase in the deviation from the average weather on election day is associated with a decrease in voter turnout by 2.68 per cent. We find similar results for snow (-0.87). We also find having relatively bad weather favours Republicans by a sizeable 4.96 percentage point increase in their vote share.
Now that we have conducted time and unit FE separately, but we can
conduct a two-way FE model that controls for unit and time FE at the
same time. In a two-ways FE model, we are controlling for all
unit-specific time-invariant confounders and year-specific confounders
that affect all units at the same time. To do this using the
plm() is very simple, we only need to change the
effect argument and set it equal to
twoways.
Exercise 10: Conduct a two-way FE model looking at the effect
of weather on voter turnout and GOP vote share. Use both measures of
weather Rain_Dev and Snow_Dev. Set the
effect argument equal to "twoways". Store the
outputs into objects and then use the summary() to check
the results.
# two-way Turnout - rain
turnout_twoway_plm_rain <- plm(Turnout ~ Rain_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_twoway_plm_rain)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ Rain_Dev, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-42.28722 -4.42452 -0.02928 4.80940 49.96074
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev -1.08923 0.17878 -6.0926 1.121e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2514600
R-Squared: 0.00092226
Adj. R-Squared: -0.076796
F-statistic: 37.1194 on 1 and 40211 DF, p-value: 1.1212e-09
# two-way Turnout - snow
turnout_twoway_plm_snow <- plm(Turnout ~ Snow_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_twoway_plm_snow)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ Snow_Dev, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-42.128610 -4.425468 -0.023033 4.794694 50.905661
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -0.83368 0.15250 -5.4667 4.613e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2515000
R-Squared: 0.00074264
Adj. R-Squared: -0.076989
F-statistic: 29.8845 on 1 and 40211 DF, p-value: 4.6127e-08
# two-way GOP vote share - rain
gop_twoway_plm_rain <- plm(GOPVoteShare ~ Rain_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_twoway_plm_rain)
Twoways effects Within Model
Call:
plm(formula = GOPVoteShare ~ Rain_Dev, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-48.96486 -5.15510 -0.03426 5.35938 75.59512
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev 2.4717 0.2205 11.209 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 3837100
Residual Sum of Squares: 3825100
R-Squared: 0.0031151
Adj. R-Squared: -0.074432
F-statistic: 125.651 on 1 and 40211 DF, p-value: < 2.22e-16
# two-way GOP vote share - rain
gop_twoway_plm_snow <- plm(GOPVoteShare ~ Snow_Dev, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(gop_twoway_plm_snow)
Twoways effects Within Model
Call:
plm(formula = GOPVoteShare ~ Snow_Dev, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-48.9149638 -5.1694003 0.0032289 5.3506297 75.5093811
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -0.15257 0.18836 -0.81 0.418
Total Sum of Squares: 3837100
Residual Sum of Squares: 3837000
R-Squared: 1.6315e-05
Adj. R-Squared: -0.077772
F-statistic: 0.656056 on 1 and 40211 DF, p-value: 0.41796
We again find evidence that weather is associated with a decrease in turnout and an increase in the GOP vote share once we include time and unit fixed effects. With time and unit fixed effects we are controlling for all time-invariant confounders. We are also controlling for time-specific confounders, but a particular type of time-specific confounders, only those that affect all units in the same period of analysis. Thus, we need to further control for time-varying unit-specific confounders such as the average income or unemployment rates at the county levels, which are different from one County to another and also vary from one presidential cycle to another.
Exercise 11: Run a two-way FE model looking at the effect of
weather only on Turnout. The syntax is the same as
lm(), you can include additional variables into your model
by adding the variable name followed by a + sign between
variables. Include the following covariates: PcntBlack,
ZPcntHSGrad, FarmsPerCap,
AdjIncome and Literacy. Store the output from
your plm() estimations into separate objects and then use
the summary() function to check the results. Again, conduct
separate estimations one only using the Rain_Dev and
another model using the Snow_Dev variable.
# turnout - rain + covariates
turnout_twoway_plm_rain_cov <- plm(Turnout ~ Rain_Dev + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_twoway_plm_rain_cov)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ Rain_Dev + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-36.650721 -4.040156 -0.034597 4.124277 53.489140
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Rain_Dev -1.231709 0.159845 -7.7056 1.332e-14 ***
PcntBlack -1.012833 0.013586 -74.5515 < 2.2e-16 ***
ZPcntHSGrad -0.173248 0.118449 -1.4626 0.1436
FarmsPerCap -1.681896 2.806345 -0.5993 0.5490
AdjIncome -1.014550 0.201517 -5.0346 4.810e-07 ***
Literacy -7.416846 0.156565 -47.3723 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2005400
R-Squared: 0.20324
Adj. R-Squared: 0.14115
F-statistic: 1709.31 on 6 and 40206 DF, p-value: < 2.22e-16
# turnout - snow + covariates
turnout_twoway_plm_snow_cov <- plm(Turnout ~ Snow_Dev + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(turnout_twoway_plm_snow_cov)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ Snow_Dev + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-36.816275 -4.045457 -0.045859 4.128219 53.437952
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -0.656375 0.136656 -4.8031 1.568e-06 ***
PcntBlack -1.012845 0.013595 -74.5008 < 2.2e-16 ***
ZPcntHSGrad -0.184131 0.118495 -1.5539 0.1202
FarmsPerCap -1.951811 2.808382 -0.6950 0.4871
AdjIncome -1.066356 0.201714 -5.2865 1.254e-07 ***
Literacy -7.397950 0.156696 -47.2122 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2007200
R-Squared: 0.20252
Adj. R-Squared: 0.14038
F-statistic: 1701.73 on 6 and 40206 DF, p-value: < 2.22e-16
Now let’s report the FE models only for the turnout
outcome variable into a single regression table. Let’s also only report
for the rainfall variable (otherwise we will have 14
models). Let’s use screenreg() and let’s compare the
results. You can store all of the outputs into one single
list() and give them names, as you can see below.
plm_models <- list("Time fixed effect" = turnout_time_plm_rain,
"Unit fixed effect" = turnout_unit_plm_rain,
"Twoway" = turnout_twoway_plm_rain,
"Twoway + cov" = turnout_twoway_plm_rain_cov)
Exercise 12: Report in a regression table the following
modes: time FE, unit FE, Two ways and two ways FE models for the
rainfall variable (Rain_Dev) and only the voter turnout
(Turnout) outcome. Use the screenreg()
function.
screenreg(plm_models,
stars = 0.05,
digits = 2,
custom.coef.names = c("Rain", "% Black ", "Graduates", "Rural", "Income", "Literacy"),
reorder.coef = c(1, 3, 4, 5, 2 ,6),
caption = "Fixed effects models - Voter turnout",
caption.above = TRUE)
=========================================================================
Time fixed effect Unit fixed effect Twoway Twoway + cov
-------------------------------------------------------------------------
Rain 2.16 * -2.68 * -1.09 * -1.23 *
(0.31) (0.18) (0.18) (0.16)
Graduates -0.17
(0.12)
Rural -1.68
(2.81)
Income -1.01 *
(0.20)
% Black -1.01 *
(0.01)
Literacy -7.42 *
(0.16)
-------------------------------------------------------------------------
R^2 0.00 0.01 0.00 0.20
Adj. R^2 0.00 -0.07 -0.08 0.14
Num. obs. 43340 43340 43340 43340
=========================================================================
* p < 0.05
One limitation with the plm() function is that it
calculates clustered standard errors without correcting for small
samples and not reducing downward bias. The data that we are using is
far from small, but when you are dealing with small panels, you may want
to try to conduct this small sample correction. For that, we can use the
coeftest() and vcovHC() function to obtain
cluster-robust standard errors and draw inference based on robust
standard errors.
The syntax’s a little bit tricky, but let’s look at the code below
from the inside out. The first function is vcovHC()
function. vcovHC() calculates heteroscedasticity-consistent
standard errors or aka “HC1”. Then, once we calculate our standard
errors, we can perform a z/quasi-t Wald test using the
coeftest() using the standards errors that we estimated
from the vcovHC(). The first argument inside of the
vcovHC() function is the object where you stored the output
from your fixed-effect model from plm(). The second
argument is type. This is the type of small sample
adjustment that we want to make. “HC1” is the commonly used approach to
correct for small samples, so we can set type equal to
"HC1". Finally, the third argument is cluster
is an argument that we can use to determine whether we want to obtain
the variance-covariance matrix clustered by units
("groups") either or by time ("time"). You can
set both by setting the argument
cluster = c("group", "time"). Then, the second function is
coeftest(). The first argument of this function is the
object where you stored your output from plm(). The second
argument vcov is the variance-covariance matrix of the
estimated coefficients. In this case, rather than using the non-robust
variance-covariance matrix from the plm() function, we will
use the White’s heteroscedasticity-consistent covariance matrix that we
obtained from the vcovHC() function. Using this matrix also
allow us to overcome with serial correlation, cross-sectional dependent
and heteroscedasticity across groups and time. You see the syntax below
for the two-way fixed effect model where we looked at the effect of
rainfalls on voter turnout, including covariates.
out_rain_cov <- coeftest(turnout_twoway_plm_rain_cov, vcov = vcovHC(turnout_twoway_plm_rain_cov, type = "HC1", cluster = c("group", "time")))
out_rain_cov
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
Rain_Dev -1.231709 0.170238 -7.2352 4.73e-13 ***
PcntBlack -1.012833 0.054346 -18.6367 < 2.2e-16 ***
ZPcntHSGrad -0.173248 0.279964 -0.6188 0.53604
FarmsPerCap -1.681896 7.182119 -0.2342 0.81485
AdjIncome -1.014550 0.480099 -2.1132 0.03459 *
Literacy -7.416846 0.454226 -16.3285 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s important to separate estimation from inference, so first we got
our point estimate, which is equal -1.23, but when we want
to make inference from this point estimate, we need to make the
appropriate adjustments to yield accurate standard errors. If you see
the “unadjusted” standard errors from plm() of the Rainfall
variable were 0.159, but once we did this small sample bias correct,
they changed to 0.170, which is not that much, a roughly 6% increase.
Presumably the standard errors did not change substantially given that
we are working with a relatively large sample. In other cases, the
change might well have a meaningful impact on the interpretation of your
findings.
Now obtain the estimates of the two-way fixed effect model of snowfalls on voter turnout and adjust your standard errors.
Exercise 13: Perform the small sample correction to the
standard errors obtained from the two-way FE with covariates, do this
for the model that looks at the effect of snowfall
(Snow_Dev) on voter turnout (Turnout). Compare
the standard errors to the “unadjusted” standard errors obtained from
plm().
summary(turnout_twoway_plm_snow_cov)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ Snow_Dev + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-36.816275 -4.045457 -0.045859 4.128219 53.437952
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Snow_Dev -0.656375 0.136656 -4.8031 1.568e-06 ***
PcntBlack -1.012845 0.013595 -74.5008 < 2.2e-16 ***
ZPcntHSGrad -0.184131 0.118495 -1.5539 0.1202
FarmsPerCap -1.951811 2.808382 -0.6950 0.4871
AdjIncome -1.066356 0.201714 -5.2865 1.254e-07 ***
Literacy -7.397950 0.156696 -47.2122 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2007200
R-Squared: 0.20252
Adj. R-Squared: 0.14038
F-statistic: 1701.73 on 6 and 40206 DF, p-value: < 2.22e-16
out_snow_cov <- coeftest(turnout_twoway_plm_snow_cov, vcov = vcovHC(turnout_twoway_plm_snow_cov, type = "HC1", cluster = c("group", "time")))
out_snow_cov
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
Snow_Dev -0.656375 0.167908 -3.9091 9.278e-05 ***
PcntBlack -1.012845 0.054276 -18.6611 < 2.2e-16 ***
ZPcntHSGrad -0.184131 0.280286 -0.6569 0.51122
FarmsPerCap -1.951811 7.199036 -0.2711 0.78630
AdjIncome -1.066356 0.481178 -2.2161 0.02669 *
Literacy -7.397950 0.454405 -16.2805 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that without the small sample adjustment, the standard errors were 0.136, but once we did this small sample adjustment we got slightly larger standard errors (0.167). They don’t seemed to change that much, but actually this is a 22% increase, which may radically change whether your results are statistically significant or not!
So far we have looked at fixed effects using the lm()
and plm() functions, assuming that we should run a fixed
effect model. But which model should we go with? Let’s now test whether
time-FE or random effects is the more appropriate model.Recall the
differences between the two estimators.
The fixed effects estimator - or within
estimator accounts for unobserved heterogeneity that is correlated
with the independent covariates. In this model, the group means are
fixed, with each one having a separate intercept - recall the
year-dummies in the models above. For the time-fixed effects in our
example this means we are essentially looking at the unit-demeaned data
for each time period. In other words, we are looking at the variation
across units within a certain year - i.e. for each presidential
election In other words, we are estimating our coefficient while time is
fixed.
\[FE: Y_{i,t} = \alpha_i + \beta_1 X_{i,t} + u_{i,t}\]
In the case of a random effects model, we assume
that \(\alpha_i\) has zero mean. In
other words, instead of using fixed means per group, the group means are
a random sample of a population. While the fixed effects model ‘gets
rid’ of unobserved time-invariant heterogeneity (\(\alpha_i\)), the random effects model makes
the assumption that the unobserved effect is uncorrelated with the
explanatory variables. If the assumption holds, the random effects
estimator is more efficient as it accounts for \(\alpha_i\) or time-invariant covariates. In
essence, the random effects model is a weighted average of the
within estimator and a between estimator that looks at
effects between units.
\[RE: Y_{i,t} = \alpha_i + \beta_0 + \beta_1 X_{i,t,1} + \beta_2 X_{i,t,2} + u_{i,t}\]
We can use two different tests to check our data in order to find out
which model is more appropriate. First, the Hausman-Test
(sometimes also called Durbin–Wu–Hausman test) analyses the
difference of the vectors of the coefficients of the two models under
the null hypothesis that the models’ estimates are consistent. If the
test finds that the models are consistent, we choose the random effects
model as it is more efficient than the fixed effects model.
However, the random effects model is inconsistent if there indeed is
unobserved heterogeneity correlated with the covariates due to omitted
variables, in which case the FE estimator should always be
preferred.
An F-Test looks at whether any additional effects - such as
one-way or two-way effects - are predictive of the outcome. That is, it
tests the effects introduced in a within-model as compared to another
model such as a random effects model. If the effects of the
within-estimator are not significant, we can, again, use the random
effects model to increase the efficiency of our estimators. If this is
not the case, we go with FE, which is always unbiased with regard to
unobserved heterogeneity that is constant over time.
Let’s now conduct the tests. To do so, we first specify
fixed-effects and random effects models using the
plm package. The syntax is essentially the same as for
lm(). However, you need to add the model
argument as well as the index argument to specify the time
variable (for fixed effects) and the unit and time variables (for random
effects).
Exercise 14: Specify a time-fixed effects and random effects
model using plm().Run this panel data estimation for the
GOP vote share outcome. Include the same covariates that we included
before. Store the outputs from these models into different
objects.
require(plm)
# Time-FE.
FE_time <- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "within", index="Year", data=weather)
#Note: The identical model can be specified by explicitly indicating time-FE:
#FE_time2 <- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "within", effect="time", index=c("FIPS_County","Year"), data=weather)
# Random Effects
random <- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, model= "random", index=c("FIPS_County", "Year"), data=weather)
Exercise 15: Conduct both a Hausman test and
F-test using the models you just specified. Interpret your
results.
You can use the phtest and pFtestcommands
for Hausman and F-Tests, respectively. The syntax is simple and similar
for both: command(Model1, Model2).
phtest(FE_time, random)
Hausman Test
data: GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + ...
chisq = 1689, df = 6, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent
pFtest(FE_time, random)
F test for individual effects
data: GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + ...
F = 205.47, df1 = 13, df2 = 43320, p-value < 2.2e-16
alternative hypothesis: significant effects
What do the tests tell us? In the case of the Hausman test, we can reject the null hypothesis of consistency and find that the models are inconsistent. The F-test shows us that we can reject the null hypothesis of no significant effects in the fixed-effects model: The FE model indeed adds significant effects. So, both models would caution us against using the random effects model and make us choose the FE estimator. Both tests giving us the same indication is the best outcome as we can be pretty sure about the need to choose the FE model. However, in some cases you might get two different indications - in which case it usually is a good choice to choose the conservative option: the unbiased but (possibly) inefficient FE estimator.
Let’s now finally look at another estimator that can be of use when working with panel data. The First Difference Estimator looks at the differences between time units. Similar to the FE estimator, time invariant unit-specific covariates are dropped from the model as it only considers changes from one time period to the other:
\[FD: \Delta Y_i = \delta + \beta_1 \Delta X_{i1} + \beta_2 \Delta X_{i2} + \Delta v_i\]
You might have noticed a similarity to the FE estimator. While the FE estimator demeans over units or times, the FD estimator considers differences over time periods. Accordingly, the first difference estimator is identical to the FE estimator if the number of time periods is two: Whenever this is the case, the differencing of the FD estimator and the demeaning of the FE estimator are equivalent.
Exercise 16: Estimate the first difference estimator using
the same model specification as in Exercise 6.
Make sure to use the plm package and the specify the FD
estimator with the argument model="fd" and to specify the
unit and time variable using the index argument.
fd_model <- plm(GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy,data=weather,effect="individual",model="fd",index=c("FIPS_County","Year"))
summary(fd_model)
Oneway (individual) effect First-Difference Model
Call:
plm(formula = GOPVoteShare ~ Rain + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "individual",
model = "fd", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Observations used in estimation: 40225
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-102.8868 -9.8854 2.0737 9.5852 82.4700
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.688633 0.094654 7.2753 3.52e-13 ***
Rain 8.279615 0.241035 34.3503 < 2.2e-16 ***
PcntBlack -1.779539 0.081070 -21.9507 < 2.2e-16 ***
ZPcntHSGrad -1.085270 0.558694 -1.9425 0.052082 .
FarmsPerCap 100.925670 8.221362 12.2760 < 2.2e-16 ***
AdjIncome 8.482942 0.426965 19.8680 < 2.2e-16 ***
Literacy 1.547906 0.558677 2.7707 0.005597 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 11471000
Residual Sum of Squares: 10767000
R-Squared: 0.061342
Adj. R-Squared: 0.061202
F-statistic: 438.048 on 6 and 40218 DF, p-value: < 2.22e-16
The output of the FD estimator indicates the association of the
effect of a change in an independent variable from one time period to
the the following period for the same unit. That is, if
Rain in a county increase by one inch, Republican vote
share is estimated be 8%-points higher on average in the
next period. Note that the FD estimator does not consider longer time
effects, i.e. those that have an effect beyond the following time
period. This might be useful if we have strong reasons to believe this
is what happens, otherwise the FE estimator is generally the more
powerful estimator.
Now let’s go back estimating the effect of weather on voter turnout.
We also can look at the extensive and intensive margin of weather on
vote turnout. Extensive margin refers to a discrete change in the level
to which an activity or intervention takes place. For example, going
from normal daily election weather to a day that is above this average.
Let’s dichotomise the Rain_Dev and Snow_Dev
variable. Let’s set 1 for any day that the weather is worse than normal
(Rain_Dev > 0 or Snow_Dev > 0)
.Meanwhile, intensive margin refers to the degree (intensity) to which a
resource is utilised, but more broadly on how strong the intervention
is, which is basically what we did before.
Homework 1: Create a dummy variable
weather <- weather %>%
mutate(dummy_rainfall = ifelse(Rain_Dev > 0, 1, 0),
dummy_snowfall = ifelse(Snow_Dev > 0, 1, 0))
# 9616 County - election day with rainfall above election day county average
table(weather$dummy_rainfall)
0 1
33724 9616
# 2439 County - election day with snowfalls above election day county average
table(weather$dummy_snowfall)
0 1
40901 2439
Homework 2: Run plm() looking at the extensive
margin of weather on voter turnout. Use the plm() function.
Include the same covariates that we used in the previous exercise. Set
the model argument equal to "within" and set
the effect argument equal to "twoways" so we
can get the estimates of a two-way FE.
extensive_rain_cov <- plm(Turnout ~ dummy_rainfall + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(extensive_rain_cov)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ dummy_rainfall + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-36.126165 -4.047052 -0.040187 4.135032 53.567761
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
dummy_rainfall -1.061556 0.099175 -10.7039 < 2.2e-16 ***
PcntBlack -1.012718 0.013575 -74.6002 < 2.2e-16 ***
ZPcntHSGrad -0.196365 0.118366 -1.6590 0.09713 .
FarmsPerCap -1.044779 2.805650 -0.3724 0.70961
AdjIncome -0.989087 0.201407 -4.9109 9.102e-07 ***
Literacy -7.411435 0.156459 -47.3698 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2002600
R-Squared: 0.20433
Adj. R-Squared: 0.14233
F-statistic: 1720.84 on 6 and 40206 DF, p-value: < 2.22e-16
extensive_snow_cov <- plm(Turnout ~ dummy_snowfall + PcntBlack + ZPcntHSGrad + FarmsPerCap + AdjIncome + Literacy, effect = "twoways", model = "within", index = c("FIPS_County", "Year"), data = weather)
summary(extensive_snow_cov)
Twoways effects Within Model
Call:
plm(formula = Turnout ~ dummy_snowfall + PcntBlack + ZPcntHSGrad +
FarmsPerCap + AdjIncome + Literacy, data = weather, effect = "twoways",
model = "within", index = c("FIPS_County", "Year"))
Unbalanced Panel: n = 3115, T = 1-14, N = 43340
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-36.817656 -4.046146 -0.042443 4.129566 53.406998
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
dummy_snowfall -0.810322 0.165160 -4.9063 9.318e-07 ***
PcntBlack -1.012637 0.013594 -74.4936 < 2.2e-16 ***
ZPcntHSGrad -0.190999 0.118502 -1.6118 0.107
FarmsPerCap -2.051177 2.807308 -0.7307 0.465
AdjIncome -1.062907 0.201684 -5.2702 1.370e-07 ***
Literacy -7.408933 0.156646 -47.2972 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 2516900
Residual Sum of Squares: 2007100
R-Squared: 0.20254
Adj. R-Squared: 0.1404
F-statistic: 1701.94 on 6 and 40206 DF, p-value: < 2.22e-16